class: center, middle, inverse, title-slide .title[ # Regression: Understanding Relationships Between Variables
📊 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Regression: Understanding Relationships Between Variables --- # What is Regression? - In all the bivariate statistics so far, we described how variables move together. - But, we have not used those variables as a system. - where we have one - DV (dependent variable); and - IV (independent variable) - Regression is a statistical method that allows us to understand how the DV changes as the IV changes. --- # Example: Smoking and Lung Capacity .pull-left[ - If you recall, we previously used data on smoking and lung capacity. ] .pull-right[ <table class="table table-striped table-hover" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> cigarettes </th> <th style="text-align:right;"> lung_capacity </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 42 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 29 </td> </tr> </tbody> </table> ] --- # Plotting Smoking and Lung Capacity .pull-left[ - When we plotted these data, we found a negative linear association between the number of cigarettes smoked and lung capacity. ] .pull-right[ <img src="data:image/png;base64,#12_Regression_files/figure-html/scatterplot-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Interpreting Smoking and Lung Capacity - Further, we found that the correlation between lung capacity and cigarette consumption was -0.962. - This correlation suggests that 92.4 % of the variation in lung capacity can be explained by cigarette consumption. - This high percentage means that when we know how many cigarettes someone consumes, we can predict lung capacity. - The higher the correlation (and by extension, variance explained) the more accurate our predictions *ought* to be. --- # The Regression Line --- # What is a Regression Line? - In simple language, we draw a line through the points in a scatter plot - In more sophisticated language, we fit a linear model of the relationship between `\(x\)` and `\(y\)` - A regression line is a straight line that describes how a response variable `\(y\)` changes as an explanatory variable `\(x\)` changes. - We often use a regression line to predict the value of `\(y\)` for a given value of `\(x\)`, - when we believe the relationship between `\(x\)` and `\(y\)` is linear. --- # Equation of a Line - We can describe this line in a familiar equation - Suppose that `\(y\)` is a response variable (plotted on the vertical axis) and `\(x\)` is an explanatory variable (plotted on the horizontal axis). - A straight line relating `\(x\)` to `\(y\)` has an equation of the form `\(y = a + bx\)`; (or `\(y = mx + b\)`) --- # Equation of a Line `\(y = a + bx\)`; (or `\(y = mx + b\)`) - In this equation, - `\(b\)` ($m$) is the slope - the amount by which `\(y\)` changes when `\(x\)` increases by one unit. - The number `\(a\)` ($b$) is the intercept - the value of `\(y\)` when `\(x\)` = 0. - With those two points, any straight line can be defined - (with one exception lines parallel to the `\(y\)` axis) - within the Cartesian plane. --- # Plot Line with changing positive slopes .pull-left[ - Here, we demonstrate lines with different positive slopes. - A higher slope means a steeper line. ] .pull-right[ <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Plot Line with changing negative slopes .pull-left[ - In this plot, we demonstrate lines with different negative slopes. - Negative slopes show a downward trend as `\(x\)` increases. ] .pull-right[ <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Plot Line with changing positive intercepts .pull-left[ - This plot demonstrates how changes in the intercept shift the line vertically. - A higher intercept means the line starts higher on the y-axis. ] .pull-right[ <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Plot Line with changing negative intercepts .pull-left[ - In this plot, we explore lines with negative intercepts, shifting the line down on the y-axis. - A more negative intercept places the line lower on the y-axis. ] .pull-right[ <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: center, middle # Least Squares Method --- # Objective: Find the "best-fitting" line - When we have a set of data points, we often want to find a line that best fits the data. - To eliminate the subjectivity of creating a line to fit the data, – we need an objective way to draw the line. - Several methods exist, but the most popular is the least-squares approach. - The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> --- # Goal of Least Squares - The least-squares regression line of `\(y\)` on `\(x\)` is the line that makes - the sum of the squares of the vertical distances of the data points from the line as small as possible. - The goal is to minimize the difference between actual and predicted values of the dependent variable `\(y\)` - min($\sum\limits^{n}_{i=1}(y_{i}-(a+bx_{i}))^{2}$) - min($\sum\limits^{n}_{i=1}(y_{i}-\widehat{y}_{i})^{2}$) - min($\sum\limits^{n}_{i=1}(e_{i})^{2}$) - Sometimes the principle of least squares is described as minimizing the sum of the: - squares, - squared residuals, or - squared errors. --- # Least Squares Formulas - We have data on an explanatory variable `\(x\)` and a response variable `\(y\)` for `\(n\)` individuals. - From the data, calculate the means `\(\bar{x}\)` and `\(\bar{y}\)` - and the standard deviations `\(s_{x}\)` and `\(s_{y}\)` of the two variables and - their correlation r. - The least-squares regression line is the line: `\(\widehat{y}= a + bx\)` - with slope: `\(b=r \frac{s_{y}}{s_{x}}\)` - and intercept: `\(a=\bar{y}-b\bar{x}\)` --- # Calculating Regression Line for Smoking Data: Slope - Using our smoking data, we can estimate the slope and intercept - slope: - `\(b=r \frac{s_{y}}{s_{x}}\)` - `\(b=r_{lung capacity,cigarettes} \frac{s_{lung capacity}}{s_{cigarettes}}\)` - -0.962* `\(\frac{7.071}{7.906}\)` - -0.962* 0.894 - `\(b=\)`-0.86 --- # Calculating Regression Line for Smoking Data: intercept - Using our smoking data, we can estimate the slope and intercept - intercept: - `\(a=\bar{y}-b\bar{x}\)` - `\(a=\bar{lung capacity}-b\bar{cigarettes}\)` - `\(a=\bar{lung capacity}-\)`-0.86$\bar{cigarettes}$ - `\(a=\)` 36- -0.86*10 - `\(a=\)` 36- -8.6 - `\(a=\)` 44.6 --- # Combined Equation `\(\widehat{y}=\)` 44.6 `\(+\)` -0.86$x$ - Interpretation - Intercept: 44.6 - Slope: -0.86 --- # Interpreting the Regression Equation - The intercept makes a prediction for the `\(y\)` outcome, when `\(x\)` is 0. - Here, that means that the expected/predicted lung capacity for a non-smoker is 44.6 - The slope gives us the predicted change in outcome for a 1-unit increase in `\(x\)`. - For every 1 additional cigarette, we would expect a -0.86 decline in lung capacity --- # Using the Regression Equation to make predictions - The intercept makes a prediction for the `\(y\)` outcome, when `\(x\)` is 0. - If we want to predict lung capacity for a 5 cigarette smoker, we use the regression equation to predict `\(y\)`. `\(\widehat{y}=\)` 44.6 `\(+\)` -0.86$x$ `\(\widehat{y}=\)` 44.6 `\(+\)` -0.86$*5$ `\(\widehat{y}=\)` 44.6 `\(+\)` -4.3 `\(\widehat{y}=\)` 40.3 Given our equation, we would predict that a 5-cigarette smoker would have a lung capicity of 40.3 --- # Regression Notes - Along the regression line, a change of 1 standard deviation in `\(x\)` corresponds to an `\(r\)` change of standard deviations in `\(y\)` - The least-squares regression line always passes through ($\bar{x}$,$\bar{y}$) and ($0$, `\(a\)`) on the graph of `\(y\)` against `\(x\)`. --- # Alternative Line Estimates - MAD Regression - Minimizes the Median Absolute Deviations `\(min(\sum\limits^{n}_{i=1}\lvert y_{i}-\widehat{y_{i}}\rvert)\)` - LMS - Least Median of Squares - Ridge Regression - Maximum Likelihood Methods --- # Two Regression Lines - The distinction between explanatory variables ($x$) and response variables ($y$) is essential in regression. `\(\widehat{y}=a+b_{yx}x\)` `\(\widehat{x}=a+b_{xy}y\)` `\(b_{yx} \neq b_{xy}\)` - These b coeffiecents are *not* the same - The equation in not symmetric --- # Relation between regression and correlation - There is a close connection between correlation and the slope of the least-squares line. The slope is - `\(b_{y,x}=r_{xy} \frac{s_{y}}{s_{x}}\)` - The slope `\(b\)` and correlation `\(r\)` always have the same sign. - If you standardize `\(x\)` and `\(y\)`; - `\(b_{z_{y},z_{x}}=r_{xy}\)` - `\(a = 0\)` - geometrically, the standardizating shifts the origin to the mean, and the `\(x\)` and `\(y\)` axis are streched so that `\(sd = 1\)` - \widehat{z_{y}} = r_{xy} --- # Residuals and residual Plots - A residual is the difference between an observed value of the response variable and the value predicted by the regression line. - That is, a residual is the prediction error that remains after we have chosen the regression line: - residual = observed `\(y\)` - predicted `\(y\)` - residual = `\(y\)` - `\(\widehat{y}\)` - Residuals represent ''leftover'' variation in the response after fitting the regression line. - The residuals from the least-squares line have a special property: - the mean of the least-squares residuals is always zero --- # Residual Calculation .small[ ```r # Recall (smoke_regression=lm(lung_capacity~cigarettes,data=smoking)) ``` ``` ## ## Call: ## lm(formula = lung_capacity ~ cigarettes, data = smoking) ## ## Coefficients: ## (Intercept) cigarettes ## 44.60 -0.86 ``` ```r prediction <- data.frame(cigarettes = 5) # Predicted Lung Capacity for 5 Cigarettes (yhat=predict(smoke_regression,prediction)) ``` ``` ## 1 ## 40.3 ``` ```r # Actual Lung Capacity for 5 (yact= smoking$lung_capacity[smoking$cigarettes==5]) ``` ``` ## [1] 42 ``` ```r #Difference is the Residual yact-yhat ``` ``` ## 1 ## 1.7 ``` ```r # Get the Residuals for all the values (smoke_regression.resid=resid(smoke_regression)) ``` ``` ## 1 2 3 4 5 ## 0.4 1.7 -3.0 -0.7 1.6 ``` ```r # Plotting plot(smoking$cigarettes, smoke_regression.resid, ylab="Residuals", xlab="Cigarettes", main="Residual Plot of Smoking Data") abline(0, 0) ``` <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Residual Plots - Plot the residual ($y-\widehat{y}$) against the `\(x\)` value .pull-left[ - A good residual plot, creates a flat line - The residuals are randomly scattered around the line `\(y=0\)` ```r set.seed(5) var_x=rnorm(1000) var_y=var_x*.5+rnorm(1000,1,.1) regressionxy=lm(var_y~var_x) regressionxy.resid=resid(regressionxy) # Ploting plot(var_x, regressionxy.resid, ylab="Residuals", xlab="X", main="Good Residual Plot") abline(0, 0) ``` <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-9-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ - A concerning residual plot, shows a relationship ```r eruption.lm = lm(eruptions ~ waiting, data=faithful) eruption.res = resid(eruption.lm) plot(faithful$waiting, eruption.res, ylab="Residuals", xlab="Waiting Time", main="Old Faithful Eruptions") abline(0, 0) ``` <img src="data:image/png;base64,#12_Regression_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Beyond two variables - The residual plot for old faithful eruptions suggests that there are other variables are work. - Regression can be used to predict `\(y\)` from multiple `\(x\)`s. - However, that is beyond the scope of our class. --- # Wrapping up...